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Figure 1: Dynamic Optimization Method 


We distinguish static and dynamic optimiza- 
tion of programs: whereas static optimiza- 
tion modifies a program before runtime and is 
based only its syntactical structure, dynamic 
optimization is based on the statistical prop- 
erties of the input source and examples of 
program execution. Explanation-based gen- 
eralization is a commonly used dynamic op- 
timization method, but its effectiveness as a 
speedup-learning method is limited, in part 
because it fails to separate the learning pro- 
cess from the program transformation pro- 
cess. This paper describes a dynamic op- 
timization technique called a learn- optimize 
cycle that first uses a learning element to 
uncover predictable patterns in the program 
execution and then uses an optimization al- 
gorithm to map these patterns into benefi- 
cial transformations. The technique has been 
used successfully for dynamic optimization of 
pure Prolog. 


1 Introduction 

Program “optimization” is the task of replacing a 
program (or planner, theorem-prover, etc.) by a se- 
mantically equivalent one with superior performance. 
“Semantic equivalence” means that both the original 
and the optimized programs compute the same in- 
put/output relation. Let us differentiate between two 
kinds of optimization: Static optimization methods ap- 
ply before program execution; following analysis of the 
local and global structure, syntactic forms are replaced 
by equivalents that are expected to perform as well or 
better, regardless of the input problem. Examples in- 
clude the familiar code-optimization methods for com- 
pilers. Dynamic optimization , by contrast, uses experi- 
ence gained from the actual execution of the program 
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to improve its expected performance on subsequent 
runs. Memoization, explanation-based generalization, 
and unfold/fold transformations are familiar methods 
of dynamic optimization. 

Currently most dynamic optimization methods are 
based on a “caching” model, whereby individual prob- 
lem solutions (possibly after being generalized some- 
what) are either remembered or discarded. A diffi- 
culty with caching techniques is that the learned in- 
formation may sometimes lead to program transfor- 
mations that ultimately degrade, rather than enhance, 
program performance — particularly with highly recur- 
sive programs. Various methods of utility analysis, 
e.g., (Gratch and Dejong, 1991; Markovitch and Scott, 
1989; Minton, 1989; Shavlik, 1990), have been pro- 
posed to address such problems as “expensive chunks” 
and “generalization-to-N” . The fundamental draw- 
back, however, is much deeper: by intertwining the 
learning and the transformation processes , the learn- 
ing process may never converge , and as a result, find- 
ing truly effective transformations can be difficult or 
impossible . 
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The primary contribution of this paper is a new 
method — called the learn- optimize cycle — for separat- 
ing the learning task from the optimization task. We 
consider a model (Figure 1) in which both a source 
program P and a source S of problems are the in- 
put; the task is to construct an equivalent program P' 
whose expected performance is as good as that of P, 
and hopefully much better, on problems drawn from 
S. A learning element is used to observe a number of 
program executions. Naturally, the sample size must 
be large enough to detect statistically regular features 
in the examples. Then a separate algorithm (the “opti- 
mizer”) is used to construct a new program that, with 
high probability, performs as well or better on S. A 
series of such learning and optimizing passes are used 
to find a (locally) optimal program. 

To test the method, I built a prototype of a dynamic 
program optimizer for Prolog programs. I first modi- 
fied a Prolog compiler so that compiled programs will 
pass information about their execution to a learning 
element called a TDAG. I then compiled the target 
program and collected data from a set of its execu- 
tions, drawing problems from a particular randomized 
problem generator. After the learning element had 
stabilized, I used the learned data to modify the pro- 
gram using two transformation techniques: clause re- 
ordering and unfolding. Both techniques are known to 
preserve the program semantics (Sterling and Shapiro, 
1986; Tamaki and Sato, 1984), so the optimized pro- 
gram was equivalent to the original. This learn- 
ing/optimization process was then repeated until the 
optimizer could find no more transformations. The 
performance of the resulting program was then com- 
pared to that of the original. Experimental results are 
summarized later. 

In this paper we first review related work and then de- 
scribe briefly the learning element (TDAG) and its ap- 
plication to learning Prolog execution properties. We 
then show how the optimizer uses the TDAG infor- 
mation to modify the source program. Experimental 
procedures and results are given, followed by a critique 
of this methods with ideas for futher development. 

2 Related Work 

The objective of this research is to find a practical 
way to do speedup learning that rests more solidly 
on first principles than I have found in machine- 
learning papers. Recent papers, such as (Etzioni, 1990; 
Laird, 1991; Laird and Gamble, 1990a; Letovsky, 1990; 
Subramanian and Feldman, 1990), have shown how 
important the distribution and the order of the exam- 
ples is in EBG-based systems and how it affects the 
estimates of utility of learned rules. Etzioni, for ex- 
ample, noticed that on some domains Prodigy/EBL 
(Minton, 1989) was learning rules that could be found 
just as effectively, and more efficiently, using a static 


learner. 

Subramanian and Feldman (Subramanian and Feld- 
man, 1990) demonstrated that one could feasibly pre- 
dict the utility of certain transformations, and that 
unfoldings of only a few levels, instead of the EBG 
method of unfolding the entire solution, were worthy 
of study. The idea of incorporating costs and proba- 
bilities into the TDAG projections was inspired by a 
paper by Yamadaand Tsuji (Yamada and Tsuji, 1990; 
Yamada, 1992), whose analysis showed that online 
statistics could be used to avoid utility problems with- 
out the need to benchmark each change individually on 
a set of example problems as in the Prodigy system. 
In recent work (Segre et a/., 1992) a combination of 
optimization methods, including caches and dynamic 
reordering, have been applied to the task of improving 
the performance of an automated deduction system, 
with considerable success. 

A number of investigations of how to find an opti- 
mal ordering of conjunctive queries have pointed the 
way for several researchers to base reorderings on costs 
and probabilities. Smith and Genesereth (1985) is a 
good example, and Greiner (Greiner, 1989) builds on 
these results with some of his EBL-optimization work. 
Working with Orponen (Greiner and Orponen, 1991), 
he has developed this idea into a dynamic optimiza- 
tion algorithm for query databases — one that is truly 
“optimal” (in the sense of PAC learning), in contrast 
to this and other work where “optimize” is a misnomer 
for “improve.” To date, however, their analysis applies 
to a restricted database, and no implementation of the 
algorithm has been reported. 

The idea of using learning to devise program speedups 
and then to incorporate them into the original source 
program was stolen from PROLEARN (Prieditis and 
Mostow, 1987), one of the first dynamic program op- 
timizers and possibly the first to employ partial eval- 
uation techniques effectively. This approach stands 
in contrast to the practice of learning “search-control” 
rules per se (as in SOAR and Prodigy), without trying 
to convert them into program transformations 

Gooley and Wah (Gooley and Wah, 1989) investigated 
extensively the use of Markov models of execution in 
order to speedup Prolog programs. In their transfor- 
mations they reordered both the clauses and the sub- 
goals within a clause, using costs and probabilities to 
determine the best ordering. Furthermore, their aim 
was to support optimization of true Prolog, including 
cuts and second-order constructs. Learning was not 
the focus of their work, nor did they attempt to choose 
different orderings at different places in the proof or to 
perform unfoldings, but their encouraging results pro- 
vided a number of good ideas. Indeed, this work can 
be viewed as extending theirs to include clause unfold- 
ing and admit reorderings based on context. 
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3 The TDAG 

Our optimization method begins with a learning 
phase. Almost any algorithm that learns to predict 
sequences could be used; we developed the TDAG al- 
gorithm, however, expressly for this problem. An ef- 
ficient way to learn to predict sequences of discrete 
symbols, it has other applications and is discussed in 
detail elsewhere (Laird, 1992). Here we shall give only 
the main ideas as they apply to Prolog optimization. 

Consider an input source that generates a continual 
stream of symbols (e.g., w q x b q s s a v s. . . ”), 
and suppose we want to learn to predict the next sym- 
bol probabilistically (e.g., the next symbol will be q 
with probability 0.7 or x with probability 0.24, etc.). 
A possible approach is to model the input source as a 
Markov model; unfortunately no tractable method for 
learning a general family of Markov models is known 
(Abe and Warmuth, 1990; Laird, 1988). The TDAG 
approach, which draws on ideas from adaptive data 
compression (Bell ei a/., 1990), offers a practical com- 
promise that ameliorates many of the theoretical prob- 
lems associated with Markov models. 

We proceed as follows. Let a, be the i’th symbol to 
arrive in the input stream. We define the set of suf- 
fixes (at time i) to be the set of i strings, aia? . ..a*, 
d 2 ".aj, ..., aj_ia t , and a,. We can keep a table in 
which we count the occurrences of each suffix of the 
input as each input symbol is observed. For example, 
after observing the four input symbols “a b b b” we 
would have a table containing a (1 occurrence), b (3 
occurrences), ab (1), bb (2), abb (1), bbb (1), and abbb 
( 1 ). 

Of course, the size of a table of suffixes grows quite 
rapidly — potentially as the square of the number of in- 
put symbols — so maintaining such a table is not prac- 
tical. But such a table can be used to predict the 
probability of the next symbol, as follows. Suppose 
that we have seen 100 input symbols so far, and that 
the past three input characters were “...a b b”. As- 
sume the suffix abb appears in the table with a count of 
30 times and that the suffix abba occurs 6 times. This 
means that six times out of thirty, the sequence abb 
has been followed directly by the character a; hence 
we can estimate the likelihood of next seeing an a as 
6/30 if we base our prediction for the next character 
on the preceding three characters. We could also base 
the prediction on the preceding n characters, for any 
n from 0 to 99. 

Besides predicting the next symbol, we can also com- 
pute a prior probability for each suffix in the table. 
The prior probability of a suffix 5 of length k is the 
probability that the next k input symbols will be 5, 
given no information about the preceding characters. 
In our example, abb has an estimated probability of 
30/97, since it has occurred 30 times out of a total of 


100 — 3 = 97 suffixes of length 3. 

This algorithm will remain impractical unless we limit 
the growth of the suffix table. Let us do so by re- 
moving all suffixes S • x (where S is a string and x a 
symbol) for which the prior probability of 5 is less than 
some value 0 . The parameter 0 is a positive fraction 
chosen by the user on the basis of the amount of avail- 
able storage. In the preceding example, if 0 < 30/97, 
the table entry abba will be kept; otherwise it will be 
discarded. In typical cases the size of the table will 
be limited to about O(A0~ l ) entries, where A is the 
number of distinct input symbols, although periodici- 
ties in the input source can still cause the size of the 
table to grow without limit. We must, therefore, also 
impose an upper limit D to the length of any suffix 
stored in the table — a limit that will rarely be reached 
in practice. 

This also suggests a reasonable way to decide which 
suffix to use to predict the the next character: use 
the longest suffix 5 whose count is at least M and for 
which there is at least one extension 5 • x in the table, 
where a: is a symbol. The parameter M is chosen based 
on the desired confidence in the prediction probability. 
Other ways to formulate predictions are possible, but 
this one is effective, simple, fast, and principled. 

We now have an algorithm that is nearly practical. 
The only additional improvement is to structure the 
table as a tree for efficiency. The root of this tree is 
labeled by the empty string; and if the suffix S and its 
one-character extension 5 • x (where S is a string and 
x a character) are both in the table, then in the corre- 
sponding tree there is an edge from the node labeled 
5 to node labeled 5 • x. For example, if abb and abba 
are entries in the table, then the node labeled abb will 
have among its children a node labeled abba; if there 
are no extensions of abb in the table, then abb occurs 
as a leaf in the tree. 

Luckily there are simple, efficient algorithms (not 
given here) for updating the tree as each new input 
symbol arrives and for predicting the next symbol 
probabilistically. We call the resulting tree a TDAG 
and refer to the algorithm as the TDAG learning ele- 
ment. It can be shown that the algorithm converges in 
the limit to a useful approximation of any Markovian 
input source. 

For dynamic optimization we shall need to learn from 
Prolog proofs, which are trees, not strings. It is easy to 
extend the TDAG to learn a class of tree-structured se- 
quences called multi-strings. Here each input symbol 
x comes with a unique integer > 0 called its multi- 
plicity v, which we indicate by writing x u . Formally, a 
multi-string consists of a symbol x v concatenated with 
v multi-strings. For example, tt a 2 Co bi Co” denotes 
a multi-string in which the symbol a has multiplicity 
two and thus is followed by two multi-strings: one con- 
sisting of only the symbol c, and one consisting of the 
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multi-string u bi Co”. Multi-strings are most easily ex- 
hibited as ordered trees in which the root node x u has 
as children v subtrees representing the v multi-strings 
that follow x. Note that ordinary strings are just a 
special case of multi-strings in which each symbol has 
multiplicity one except the last, which has multiplicity 
zero. Generalizing the TDAG to learn multi-strings is 
easy: just as a string TDAG makes a prediction of 
which symbols are most likely to follow the recent in- 
put symbol xi, a multi-string TDAG makes v predic- 
tions, one for each of the successors for the most recent 
input symbol x v . Converting a TDAG algorithm for 
strings into one for multi-strings is a simple matter 
of replacing some single- valued elements into arrays of 
size v and using a stack to keep track of our depth in 
the multi-string. 


4 Using a TDAG to learn Clause 
Sequences 

Logic programs represent search problems in which the 
task is to find a clause [C] : H <— Ti, 7^, . . . whose 

head H unifies with the input goal and whose sub- 
goals 7} (after applying a unifying substitution) are 
all refutable. If we can predict which clause should 
be chosen for any given goal, then the cost of run- 
ning the program is linear in the size of the solution. 
Our intention is to use a TDAG to guide us to the 
right clauses during the proof. Also, unfolding part 
of a proof reduces the size of some solutions and po- 
tentially changes the search order. We want to use 
a TDAG to tell us which unfoldings will improve the 
average cost of solutions, not just the cost of a single 
solution. Other program transformations are possi- 
ble; we limited our research to these two since they 
preserve the semantics of the program, are frequently 
performed, and are relatively easy to understand. 

Refuting a goal G results in a proof tree (Sterling and 
Shapiro, 1986) whose root is the goal G and whose chil- 
dren are proof trees for each 9ubgoal generated by a 
resolution step. Given the proof tree one can easily de- 
rive a clause-name free, in which each node of the proof 
tree is re-labeled with the name C of the clause used 
to resolve the goal or subgoal. For example, in Figure 
2, we show such a clause-name tree for the three-step 
proof of the goal G = p(/(a)) using a program which 
will serve as a running example throughout this paper. 

The key observation is that a clause-name tree is a 
multi-string; therefore sequences of clause-name trees 
can be learned using a TDAG. Each clause C has a 
fixed number v of terms in its tail; thus each occur- 
rence of C in the clause-name tree has v subtrees whose 
root nodes are labeled by the names of the clauses 
used to resolve the subgoals. Thus the number v of 
antecedents in the body of the clause C is its multi- 
plicity. 


The basic idea is that, by learning from a sequence 
of clause-name trees, we simultaneously learn to pre- 
dict which clauses will succeed at different points in 
the proof. In order to improve program performance, 
however, both the likelihood of success and the expected 
cost of the effort need to be estimated. Consequently 
we shall gather cost information as well as likelihoods 
in our TDAG. 

The TDAG learning element is used as follows. 
First, the target program is changed to an equiv- 
alent program in which each clause [C] : H <— 

7i,T 2 , . . . is replaced by a pair of clauses: [ C \ ] : 

H — Tail-7i, Tail-72, and [C 2 ] : Tail-R *- 

Tail-Ti , Tail-72, For example, the program in Fig. 

2 is transformed as shown in Fig. 3. This transfor- 
mation helps to distinguish clauses used to resolve the 
main goal from those used to resolve subgoals and pro- 
vides more context within the execution on which to 
condition the code transformations. 

For each input problem, the Prolog interpreter solves 
the problem while building a clause-name tree. 1 
Whenever a clause C is used to try to refute a goal, 
a measurement is made of the cost $C of applying that 
clause (say, by measuring CPU time or counting uni- 
fications) and refuting its subgoals. If the clause fails, 
the name of the failing clause and the cost of attempt- 
ing it are stored as data with the tree. If it succeeds, 
the name of the successful clause and the cost of find- 
ing the solution are stored in the node, and its child 
nodes are recursively constructed from the results of 
resolving its subgoals. Note that both success and fail- 
ure costs are accrued. 

Next, the tree is passed to the multi-string TDAG al- 
gorithm, one node at a time, in pre-order. In addition 
to storing the clause-names as symbols and counting 
their successors, we also count the total number of 
attempts (successful or otherwise) to use that clause 
and the total cost of all such attempts. The TDAG, 
therefore, contains enough information to predict the 
probability that each clause will successfully resolve a 
given subgoal and the expected cost of applying the 
clause. 

As more input problems are solved and the resulting 
clause-name tree statistics are passed to the TDAG, 
the accuracy of the information increases. Unfortu- 
nately without strong assumptions about the problem 
source, there is no theoretically justified way to com- 
pute the number of input problems needed to guar- 
antee that the TDAG will achieve a given level of ac- 
curacy. The practical method I used was to feed the 
TDAG some number m of problem results and com- 


1 For our implementation second-order program ele- 
ments such as negation-by-failure and call were allowed, 
but these structures appeared as leaf nodes in the clause- 
name tree, without any analysis of their proof structure. 
Non-logical constructs like cuts were not allowed. 
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[CPI] : p (a) . 

[CP2] : P (f (X) ) <- q(h{X) ) ,p(X) . 
[ CQl ] : q (h (X) ) . 

[CQ2] : q (b) . 


CP2 



Figure 2: A simple clause-name tree. The program is shown on the left with clause labels in square brackets. 
To the right is the clause-name tree for the proof of the goal p(/(a)). 


[Cl] : p (a) . 

[C2] : p(f<X)) <- tail-q(h(X) ) ,tail-p(X> . 

[C3] : tail-p(a). 

[C4 ] : tall-p ( f (X) ) <- tall-q (h (X) ) , tail-p (X) ) . 
[C5] : q (h ( Y) ) . 

[ C 6 ] : q{b) . 

[C7] : tail-q (h { Y ) ) . 

[C8] : tail-q (b) . 

Figure 3: Initial transformation of the program in Fig- 
ure 2. 


pare the results to those of a TDAG built from 3m/2 
problem results; if the prediction probabilities differed 
significantly, I increased the sample size. The largest 
number of problems I needed for learning was 300, so 
convergence is reasonably fast. 

Summarizing, each node of the TDAG tree contains 
the name C of a clause, the number of attempts to 
satisfy a goal using that clause, the number of suc- 
cessful attempts, the total cost of those attempts, and 
the usual TDAG likelihoods for each of its subgoals. 
In Fig. 4 we show the structure of a possible TDAG 
resulting from executions of the program in Fig. 3, as- 
suming that “p” is the predicate of the main goal. The 
root node has two successors Cl and C2, the clause 
names for predicate p. Cl is a leaf because clause 
Cl has no antecedents. C2 has two subtrees, one for 
each of its two antecedents. The “tail-q” subtree has 
two clauses (C7 and C8) as children. The probability 
p(C7) (not shown) estimates the likelihood that clause 
C7 will successfully resolve the first subgoal. The cost 
$C7 estimates the expected cost of using C7 to refute 
the first subgoal of C2 in this context. (Similarly for 
C8.) 

The other subtree of C2 has two children, C3 and 
C4, whose statistics apply to the second antecedent 
(“tail-p”) of C2. Clause C4 is expected to have two 
further subtrees below it, corresponding to the two an- 
tecedents in that clause. 


5 The Optimizer Algorithm 

In this implementation the optimizer has available two 
program transformations: 

• Clause reordering: Change the order in which the 

clauses for a predicate p are attempted. For ex- 
ample, to solve the tail-q subgoal of clause C2 
in Fig. 3, clause C7 will be tried before C8 by 
virtue of its position in the program. To reverse 
this ordering — but without affecting other calls to 
tail-q — we first change clause C2 as follows: 
[C2']: p (f(X)) <- g218(h(X)),tail-p(X). 

(g218 is a new predicate symbol) and add these 
new clauses: 

[C9] : g218(b). 

[CIO]: g218(h(y) ) . 

Clauses C7 and C8 are unaffected. The new pred- 
icate g218 has the same semantics as tail-q ex- 
cept for the order of its clauses. 

• Unfolding: Resolve an antecedent of one clause 
with the head of another, resulting in a new 
clause. For example, if clause C3 is the most likely 
choice for solving the tail-p subgoal in clause C2, 
we can replace C2 by the following two clauses: 

[C2 . 1] : p(f (a)) <- tail-q(h(a) ) . 

[C2.2] : p(f(X)) <- tail-q(h(X)) >g 777(X). 

and add the clause: 

[Cll] : g777 (f (X) ) <- tail-q(h(X)), 

tail-p(X) . 

Clause Cl remains unchanged. The new proce- 
dure g777 is derived from tail-p but omits the 
clause already unfolded into C2. 

Decisions about which transformations to apply and 
where are based on the TDAG data collected dur- 
ing the learning phase. Referring to Fig. 4, sup- 
pose that in the TDAG there is a C2 node whose first 
subgoal (“tail-q”) has clauses C7 and C8 as chil- 
dren, with estimates for p(C7), $C7, p(C8), and $C8, 
resp. By a well-known result, the optimum ordering 
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Figure 4: Sample TDAG structure. 


for these two clauses is in decreasing order of the quan- 
tity p(C t *)/$Cj. Thus we can quickly find applicable 
clause reorderings from the information in the TDAG. 
Note that, since clause C8 is attempted only on in- 
stances where clause Cl fails, the estimate for p(CS) 
is actually an estimate for p(C8 | -*C7); hence the true 
optimum ordering may not always be chosen. 

The analysis of unfolding transformations is more dif- 
ficult, and most easily described by example. The 
utility of unfolding come partly from the economy of 
combining steps and partly from changing the order 
in which subgoals are resolved with their clauses. In 
Fig. 3, for example, when clause C 2 is invoked and 
its two subgoals are resolved, the clauses are tried 
in the following order: (C7,C3), (C7,C4), (C8,C3), 
(C8,C4). But if clause C3 is unfolded into the second 
subgoal as in the above example, this order becomes: 
(C7,C3), (C8,C3), (C7,C4), (C8,C4). The risk is 
that unification costs will increase when all subgoals 
fail since there is an additional clause in the procedure. 
Whether the unfolding will improve the expected cost 
of the program is important to predict with high confi- 
dence, since (unlike clause reorderings) unfoldings can- 
not be undone later by our optimizer. 

Consider the unfolding example above, where C3 is 
unfolded into the second antecedent of C 2. After the 
unfolding, the TDAG structure of Fig. 4 will change 
to that shown in Fig. 5. As a result, the expected cost 
$Root of the root node will also change. In both cases 
the statistics of Cl play an identical role, so we can 
ignore this clause in the calculations. Before unfold- 
ing, the expected cost of clause C2 is $C2, a measured 
quantity. After unfolding, the expected cost of the 
C2.1 and C2.2 clauses is $C2.1 + (1 — p(C2.1))$C2.2. 
The values of the likelihood p(C2.1) and the costs 
SC2.1 and SC2.2 are, of course, not known quanti- 
ties, but they can be predicted approximately from 
available TDAG measurements. 


Let us illustrate with the case of p(C2.1). (1— p(C2.1)) 
is the probability that clause C2.1 fails. This can hap- 
pen if the head of the clause does not unify with the 
p( . . . ) goal, or if the antecedent (tail-q) fails. C2.1 
will fail to unify with the goal if either the more gen- 
eral clause C 2 will not unify or if C2 unifies only to 
have the subgoal C3 fail. Both these likelihoods can 
be estimated from the TDAG statistics: p(C3) is a 
measured quantity; and if C2 was attempted, say, 100 
times and Cl only 75 times, then we infer that 25 times 
out of 100 the clause C2 failed due to non-unification 
of the head with the subgoal, so the probability that 
C2 fails to unify is 0.25. Similarly we can infer from 
the TDAG statistics the likelihood that the first an- 
tecedent of C2.1 (tail-q) fails. This antecedent is 
stronger than the tail-q antecedent of C 2 since the 
substitution a = X has been applied to it. In Fig. 4, if 
we attempted C2 100 times, C7 75 times, and C3 only 
10 times, then out of 75 times, we infer that the first 
subgoal (tail-q) succeeded only 10 times; hence the 
failure probability is about 65/75. Multiplying this by 
the likelihood that C 3 fails to unify gives us our esti- 
mate of the likelihood that the tail-q subgoal of C2.1 
fails. 

Details apart, the point is that the TDAG statistics 
have the data necessary to compute the expected cost 
of solving the goal after the unfolding and to predict 
whether the unfolding will be beneficial. If the esti- 
mated cost of $Root with the unfolded clauses is lower 
than that without the unfolding, the optimizer goes 
ahead with it. The number of clauses for the unfolded 
predicate (p in this example) will increase by one; the 
fallback case — clause C2.2 in our example — must be 
present to preserve the semantics in case the unfolded 
clause (C2.1) fails. When there are more than two 
clauses for a predicate, deciding where in the list of 
clauses to place the fallback case is problematic. My 
approach was to place it last in the list of clauses and 
let clause reordering in the next pass determine its best 
position. 

6 The Learn-Optimize Cycle 

We have seen how the learning element collects statis- 
tics from program executions and incrementally builds 
a TDAG that can predict the optimal clause orderings 
at various points of the search and find advantageous 
unfolding transformations. The dynamic optimization 
process for a program is an alternating cycle of learn- 
ing and optimizing passes: learning from sample exe- 
cutions, then transforming the program, learning from 
sample executions of the transformed program, trans- 
forming again, and so forth. The cycle stops when the 
optimizer can recommend no further transformations. 

Two policies govern the choice of transformations dur- 
ing each cycle. First, clause reordering has priority 
over unfolding transformations. If according to the 
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Figure 5: TDAG structure of Fig. 4 after unfolding 
clause C 2. 


TDAG the clauses for a particular subgoal are not in 
optimal order, and simultaneously one of the clauses is 
a candidate for unfolding, then the optimizer will per- 
form only the clause reordering transformation. (An 
exception is the case where a clause has likelihood one 
in solving a subgoal; in this case, both the reorder- 
ing and the unfolding can be performed.) The reason 
is that the reordering may affect the statistics used to 
evaluate the potential unfolding transformation, so the 
clauses should be in the right order before unfolding 
any of them. 

Second, priority is given to transformations at nodes 
closest to the root. If a transformation is applied to a 
node on one pass of the cycle, no descendents of that 
node are transformed on the same pass. For example, 
if in Fig. 4 we reorder clauses Cl and C 2, then any 
reordering of clauses Cl and C8 will have to wait until 
the next pass, even if the TDAG statistics currently 
recommend that C 8 should be first. 

The result of these two policies is that optimizations 
tend to occur deeper in the TDAG with each cycle, and 
thus the number of transformations performed tends 
to increase with each cycle. As noted above, several 
learn-optimize passes are used instead of one because a 
transformation at a TDAG node changes the statistics 
of the nodes below it and, as a result, the potential 
utility of any transformations at those deeper nodes. 

Each transformation increases the number of clauses in 
the Prolog program, and over the entire cycle the pro- 
gram size may increase several fold. This is not a prob- 
lem since program performance depends hardly at all 
on the size of the program. In the Prolog used for the 
experiments, clauses were retrieved from the database 
by a hash table indexed by the predicate functor. In 
some Prolog implementations, however, clauses are in- 
dexed by both the predicate functor and the leftmost 
functor of the first argument; in this case, the TDAG 


nodes would likewise be labeled by the pair of func- 
tor names, rather than by the predicate functor name 
alone. 

Recall that the cycle terminates when the optimizer 
finds no justifiable transformations in the TDAG. If 
h is the maximum height of the TDAG tree, and if 
the sample size of the learning phase is large enough, 
then with high likelihood 3/i is the maximum expected 
number of learn-optimize passes in the cycle. The fac- 
tor of 3 arises from the possibility that, at any node 
C, a reordering of its child clauses may occur on one 
pass, an unfolding on the next, and a further reorder- 
ing of the fallback clause from the unfolding on the 
one after that. On subsequent passes, transformations 
may occur at descendents of C but are not expected 
at C. There is also a statistical chance that the proba- 
bility of success and the expected cost of attempting a 
clause may change a lot when the order of the clause is 
changed, so that a clause reordering may be reversed 
on the next pass. In my experiments, however, this 
did not occur; and in fact twelve passes were the most 
required for any program, compared to the theoretical 
limit of 21. 

7 Experimental Results 

To evaluate this dynamic optimization method, I mod- 
ified a Prolog compiler so that the programs it compiles 
will construct their clause-name trees and collect the 
cost statistics as part of the search for a refutation of 
the input goal. (In the tests, I used both unification 
counts and CPU time as cost measures, with compa- 
rable results for the two measures.) After finding each 
solution to the input goal, the programs present the fi- 
nal clause-name tree to a multi-string TDAG learning 
element. More than ten Prolog programs of varying 
sizes were used to test the system. For each program, 
a problem generator was also constructed to provide a 
random set of problems for the (compiled) Prolog pro- 
gram to run. In most cases the generator was some- 
what skewed so that, instead of problems being gener- 
ated more or less uniformly by size, some regularities 
occured with higher probability. In some cases, the 
same program was run with different problem gener- 
ators to assess the sensitivity of the optimizer to the 
problem source. 

Next, the unoptimized program was put through a 
learn-optimize cycle. A sample size was empirically de- 
termined, as described above. This number was then 
used consistently by the learning element, although 
the actual set of problems changed on each pass. Af- 
ter compiling the program with the modified compiler 
and collecting statistics with the TDAG, useful trans- 
formations were identified and implemented. Then the 
optimized program was recompiled with the modified 
compiler, and the learn-optimize cycle continued. 

The task of locating and installing the optimizing 
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transformations was done by hand with machine as- 
sistance; moreover, the transformations were selected 
and installed one at a time, rather than in batches. 
This procedure — which ordinarily would (and should) 
be entirely automatic — was adopted as a research tool 
to study in detail the performance of the optimizer and 
to verify whether each recommended transformation 
improved performance as predicted. 

Clause reordering transformations are easy to identify 
and install, but the procedure for finding and con- 
structing good unfolding transformations is slow and 
cumbersome. Even with machine assistance, testing 
every possible unfolding for its expected utility value 
took time, and I felt the need for a simpler rule that 
would suggest effective unfoldings more quickly. 

I shall describe in detail the results for one program: 
the familiar member predicate defining membership in 
a list. 

/♦ member (X, Y) <- X is in the list Y. */ 

[CM1] : member (X, X._). 

[CM2]: member(X, Y.Z) := member(X,Z). 

This problem demonstrates quite well that costs, not 
just probabilities, must be considered during the op- 
timization; moreover, both unfoldings and clause re- 
orderings played important roles in its optimization. 

In this test the list Y was always a list of thirty differ- 
ent integers, and the first argument X matched exactly 
one of the integers in the list. The problem generator 
was constructed so that the target element X occurs 
exactly once in the list Y, at a position more or less uni- 
formly distributed between fourth and thirtieth in the 
list. In a stream of such problems, it is clear the clause 
CM 2 w iU be applied with much greater frequency than 
CM \ . If, therefore, we chose clauses solely by prob- 
ability, CM 2 would always be tried before CM\, and 
the optimizer would reorder the clauses so that clause 
CM 2 precedes CM In such a program the proce- 
dure would be first to go through the list to the end 
and then backtrack, testing the target against each el- 
ement of the list in reverse order. By contrast, the 
TDAG determined that in most circumstances the ex- 
pected p/$C value of CM1 was about half that of CM2 
and declined to reorder these two clauses. 

For this problem the sample size was 200 problem in- 
stances. Eight rounds of optimization were needed to 
produce the final program (given in the appendix) with 
eighteen clauses. 

Examining the optimized program, we note that 
member was unfolded so that the search for the tar- 
get element X begins with the fourth element of the 
list (clause Ml). Clauses M2 through M4 will never be 
needed with this problem generator, but the optimizer 
can’t know that and includes them for completeness. 

The recursive predicate tail-member (M5 and M6) re- 


mained unchanged throughout the cycle. Note, how- 
ever, that member calls, not tail-member, but dl51 
and other newly created predicates as a result of clause 
reorderings and unfoldings in the TDAG, before it fi- 
nally calls tail-member in clause M14. 

The three-clause predicate dl51 is a copy of 
tail-member that was produced by a reordering (plac- 
ing M16 first) and and an unfolding that skips over two 
more elements of the list before resuming the search. 
dl65a and dl65b are copies of tail-member and serve 
only to provide context for dl65c, for which the usual 
clause ordering is reversed. This reordering — whose 
utility was well justified by a reduction in average 
runtime — was quite unexpected and apparently the re- 
sult of an unintended statistical pattern in the problem 
generator. 

The program size grew from two clauses to eighteen af- 
ter three unfoldings and three reorderings. Five other 
clauses generated during the cycle ended up being un- 
referenced as a result of subsequent transformations 
and were therefore eliminated. Average costs for this 
generator were reduced by 18.5% for unifications and 
17.2% for CPU time. 

Note, finally, that all these changes are truly dynamic 
optimizations: the member source code alone will not 
point to these changes as effective. The examples — 
specifically, the statistical properties of the examples— 
are essential for understanding the utility value of 
these transformations to the original two-clause pro- 
gram. 

Space will not permit describing all our other experi- 
ments, but let us mention a few highlights: 

• The implement ed-by program used by several 
researchers, e.g., (Shavlik, 1990; Subramanian 
and Feldman, 1990), is notorious for produc- 
ing generalization- to- N anomalies in explanation- 
based generalization. In tests with three different 
problem generators, the optimizer produced only 
clause reorderings, never any unfoldings. Cost im- 
provements ranged from about 4% percent for a 
generator with little skew to about 25% for one 
with strong skew. 

• color is a brute-force graph-coloring algorithm 
using an unsophisticated backtracking search al- 
gorithm. Problems were generated more or less 
at random. No performance improvement was 
expected, and none was observed, despite sev- 
eral clause reorderings and one unfolding trans- 
formation. Significantly, however, no performance 
degradation was observed either. 

• The best cost improvements — about 40% — 
occurred for a program that parses a context-free 
language. The gains resulted mainly from unfold- 
ings and take advantage of strong patterns in the 
productions of the grammar. 
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Finally, the most striking feature of these experiments 
was the robustness of the results: several runs of the 
cycle with the same program and generator (but a dif- 
ferent random seed for the generator) almost always 
resulted in the same sequence of optimizations. I had 
a strong sense that optimizer was finding a local mini- 
mum for the program’s runtime performance, and that 
this minimum was not very sensitive to the particular 
sequence of the examples. This stands in contrast to 
reported results with EBL methods. 

8 Summary: the Learn-Optimize 
Model 

Although the details of this work pertain to dynamic 
optimization of Prolog programs, the major idea ap- 
plies more generally: dynamic optimization should be- 
gin with a learning element that analyzes a sufficiently 
large sample for the learning to be reliable, followed 
by an optimization phase that bases its changes to the 
program on the results of the learning. 

By contrast, most previous efforts to apply EBL to 
program speedup have been based upon a caching 
method: the idea is to save the solutions to individ- 
ual problems and reuse them when the problems recur. 
The EBG algorithm generalizes the solution somewhat 
before caching it, but paradoxically the larger — and 
more informative — the solution, the weaker the gen- 
eralization (Laird and Gamble, 1990a). This is the 
basis for the “generalization-to-N” problems. With 
caching comes the necessity for utility estimation: be- 
cause of the limited space in the cache and limited 
time to search that space, only the most useful chunks 
can be retained. Caching also suffers from sensitivity 
to the order of arrival of the examples: because a deci- 
sion must be made at once whether to save or discard 
a solution, unrepresentative problems that occur early 
will slow down the program until they can be displaced 
by more typical chunks at some much later time. 

Our experiments exhibited no generalization-to-N 
problems and negligible sensitivity to the input order. 
No special “utility evaluation” has to be added on be- 
cause utility analysis is the very core of the method. 
The learning element learns what it needs to evaluate 
the potential program transformations. In our case, 
since the only two transformations were clause reorder- 
ing and unfolding, the statistics learned by the TDAG 
were those necessary to perform these transformations. 

The optimization technique described here is not spe- 
cific to Prolog. The same basic method is directly 
applicable to any nondeterministic typed-term lan- 
guage (Laird and Gamble, 1990b), including lambda- 
calculus-based and combi nator- based languages. The 
relevant characteristic of these languages is that non- 
determinism is represented explicitly in the language; 
the determinism necessary for consistent computation 


is provided by the underlying operational semantics 
(e.g., SLD resolution in Prolog). By contrast, impera- 
tive languages like C probably would not benefit much 
from dynamic optimization, since the nondeterministic 
element (search) is not represented as such in the code, 
but instead is embedded in if-then-else constructs or 
procedure calls. 

Before adopting the learn-optimize cycle, I first tried 
the approach of modifying the Prolog interpreter to 
call the TDAG for search-control advice. Even when 
the TDAG always recommended the correct clause, the 
overhead of calling the learning element overwhelmed 
any cost savings, and I was never able to reduce the 
average CPU time below that of the unoptimized pro- 
gram running without the TDAG. Only then did I 
decide to use the TDAG, not for search control, but 
for guidance on program optimization. In effect, the 
method described in this paper compiles the search- 
control information into the program instead of calling 
for it at run time. 

The procedure described in this paper is a first at- 
tempt at dynamic optimization for Prolog and suf- 
fers from a number of weaknesses that are being ad- 
dressed by continuing research. The decision to pass 
the clause-name tree to the TDAG at the conclusion 
of a successful search means that only very limited 
failure statistics can be collected, and calls to higher- 
order predicates (like or and call) are not fully rep- 
resented in the TDAG statistics. Also, by evaluating 
clause probabilities in the order in which the clauses 
occur in the program, the learning algorithm could 
recommend a transformation on one cycle and undo 
it on the next. (This never occurred in the experi- 
ments, however.) But the most unsatisfactory aspect 
of this (and related) research is the lack of any charac- 
terization of how “optimal” the resulting program will 
be. In the learn-optimize cycle the changes to the pro- 
grams are based on a hill-climbing model: the program 
performance is expected to improve after each cycle, 
and optimization stops only when a local optimum is 
reached or TDAG size limits prevent further progress. 
There may be circumstances, however, where the truly 
optimum program can be reached only after transfor- 
mations that temporarily worsen its performance; in 
such cases no hill-climbing method will find such an 
optimum. 

In this paper we assume that the problem generator 
chooses problems independently from a distribution — 
i.e., the choice of the next input does not depend in 
any way upon previous inputs. One can imagine cir- 
cumstances where this assumption does not hold, and 
yet our method provides no way to carry state infor- 
mation from one problem to the next. The TDAG can 
easily retain state information from one problem to the 
next; not so easy, however, is incorporating this state 
information into the optimized program. 
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Appendix: Optimized member program 

[Ml]: memberdtem, [X1,X2,X3 I Rest]) 

dlBKltem, Rest). 

[M2]: memberdtem, [Item . Rest]). 

[M3]: member (Item, [X . Rest]) :- 

dl73 (Item, Rest). 

[H4] : memberdtem, [XI, X2 | Rest)) :- 

dl73dtem. Rest). 

[MB] : tail -member (Item, [Item I Rest]). 

[M6] : tail -member (Item, [X I Rest]) :- 

tail -member (Item, Rest). 

[M7] : dl73(Item, [Item I Rest])). 

[M8] : dl65a(Item, [Item i Rest]). 

[M9] : dl6Sa(Item. [X I Rest]) :- 

dl65b(Item, Rest). 

[M10] : dl65b(Item. [Item I rest]). 

[Mil]: dl65b(Item, (X I Rest)) 

dl65c(Item, Rest). 

[M12] : dl65c(Item, [X I Rest]) :- 

d276(Item, Rest). 

[M13] : dl6Bc(Item, [Item | Rest]). 

[M14] : d276(Item, [X I Rest]) 

tail -member (Item, Rest). 

[M15] : d276 (Item, [Item I Rest]). 

[Ml 6] : dl&Kltem, [Item I Rest]). 

[M17] : dlBl (Item, [XI, X2 I Rest]) :- 
dl65a(Item, Rest). 

[M18] : dlBl (Item, [X I Rest]) :- 

dl73(Item, Rest). 






